In [1]:
import pandas as pd
import numpy as np
import plotly.express as plx
from plotly.subplots import make_subplots
import plotly.graph_objects as go

We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters.¶

We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes.¶

The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses.¶

It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.¶

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.¶

X1 Relative Compactness¶

X2 Surface Area¶

X3 Wall Area¶

X4 Roof Area¶

X5 Overall Height¶

X6 Orientation¶

X7 Glazing Area¶

X8 Glazing Area Distribution¶

y1 Heating Load¶

y2 Cooling Load¶

In [2]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
df
Out[2]:
X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7.0 2 0.0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7.0 3 0.0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7.0 4 0.0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7.0 5 0.0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7.0 2 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ... ... ...
763 0.64 784.0 343.0 220.50 3.5 5 0.4 5 17.88 21.40
764 0.62 808.5 367.5 220.50 3.5 2 0.4 5 16.54 16.88
765 0.62 808.5 367.5 220.50 3.5 3 0.4 5 16.44 17.11
766 0.62 808.5 367.5 220.50 3.5 4 0.4 5 16.48 16.61
767 0.62 808.5 367.5 220.50 3.5 5 0.4 5 16.64 16.03

768 rows × 10 columns

In [3]:
df.isnull().sum()
Out[3]:
X1    0
X2    0
X3    0
X4    0
X5    0
X6    0
X7    0
X8    0
Y1    0
Y2    0
dtype: int64

we already know that, there is no missing values in the dataset from https://archive.ics.uci.edu/dataset/242/energy+efficiency¶

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      768 non-null    float64
 1   X2      768 non-null    float64
 2   X3      768 non-null    float64
 3   X4      768 non-null    float64
 4   X5      768 non-null    float64
 5   X6      768 non-null    int64  
 6   X7      768 non-null    float64
 7   X8      768 non-null    int64  
 8   Y1      768 non-null    float64
 9   Y2      768 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 60.1 KB
In [5]:
df.describe()
Out[5]:
X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
count 768.000000 768.000000 768.000000 768.000000 768.00000 768.000000 768.000000 768.00000 768.000000 768.000000
mean 0.764167 671.708333 318.500000 176.604167 5.25000 3.500000 0.234375 2.81250 22.307195 24.587760
std 0.105777 88.086116 43.626481 45.165950 1.75114 1.118763 0.133221 1.55096 10.090204 9.513306
min 0.620000 514.500000 245.000000 110.250000 3.50000 2.000000 0.000000 0.00000 6.010000 10.900000
25% 0.682500 606.375000 294.000000 140.875000 3.50000 2.750000 0.100000 1.75000 12.992500 15.620000
50% 0.750000 673.750000 318.500000 183.750000 5.25000 3.500000 0.250000 3.00000 18.950000 22.080000
75% 0.830000 741.125000 343.000000 220.500000 7.00000 4.250000 0.400000 4.00000 31.667500 33.132500
max 0.980000 808.500000 416.500000 220.500000 7.00000 5.000000 0.400000 5.00000 43.100000 48.030000

In this dataset, there are two output variables and 8 input variables.¶

The datatype of output variables (Y1 & Y2) are float.¶

The datatype of 6 input variables(X1,X2,X3,X4,X5,X7) are float and remaining 2 variables(X6,X8) are integer.¶

Exploratory Data Analysis¶

In [6]:
plx.imshow(df.corr(),height=750,width=750,text_auto=True)

Insights:¶

From the above correlation heatmap we know that,¶

1. Y1 and Y2 are highly positively correlated to each other.¶

2. X5(Overall Height of the building) is highly positively correlated with both the output variables (Y1 & Y2). What literally means is, if overall height of the building increases then heating and cooling load is also increased.¶

3. X4(Area of the Roof of building) is highly negatively correlated with both the output variable (Y1 & Y2). It means that, if roof area of the building increases then heating load and cooling load is decreased.¶

4. X1 and X3 (Relative Compactness and Wall Area) having moderate correlation with output variables(Y1 & Y2).¶

5. X2 (Surface Area) having moderate negative correlation with output variables(Y1 & Y2).¶

6. X3(Wall Area) variable is only important for output variables(Y1 & Y2), not for other 7 variables.¶

7. X1,X2,X4 and X5(Relative Compactness, Surface Area, Roof Area and Overall Height) are highly correlated with each other.¶

8. X7(Glazing Area) having around 0.2 correlation with both the ouput variables(Y1 & Y2) and X8 variable also, but not important for other 6 variables.¶

9. X6 (Orientation) is not important for any of the feature including the output variables (Y1 & Y2). What is Orientation and Why this is not important? Here, Orientation is nothing but how the building is positioned. We got explanation from google -----> "Orientation is how a building is positioned in relation to the sun's paths in different seasons, as well as to prevailing wind patterns. In passive design, it is also about how living and sleeping areas are designed and positioned, either to take advantage of the sun and wind, or be protected from their effects". Here, the data is in number format 2,3,4,5. It has some meaning, but we don't know what literally is. From the insights we got, we clearly know that, Y1 and Y2 is not based on the Orientation of the building and it is not important.¶

10. X8 is not important for the ouput variables(Y1 & Y2), but having corrleation of 0.2 with variable X7.¶

11. X1 & X2 are highly negatively correlated with each other.¶

In [7]:
df.corr()['X6']
Out[7]:
X1    4.678592e-17
X2   -3.459372e-17
X3   -2.429499e-17
X4   -5.830058e-17
X5    4.492205e-17
X6    1.000000e+00
X7   -9.406007e-16
X8   -2.549352e-16
Y1   -2.586763e-03
Y2    1.428960e-02
Name: X6, dtype: float64
In [8]:
df['X6'].value_counts()
Out[8]:
2    192
3    192
4    192
5    192
Name: X6, dtype: int64
In [9]:
df['X8'].value_counts()
Out[9]:
1    144
2    144
3    144
4    144
5    144
0     48
Name: X8, dtype: int64
In [10]:
plx.box(x = df['X8'],y=df['Y1'],color=df['X8'])
In [11]:
plx.box(x = df['X8'],y=df['Y2'],color=df['X8'])

From the above two graph, we know that 0 belongs to one group and rest belongs to one group.¶

Dimensionality Reduction¶

In [12]:
plx.imshow(df.corr(),height=750,width=750,text_auto=True)

1. From the above heatmap, we know that X6(Orientation) is not important for any of the variables including output variables(Y1 & Y2). So, we can remove X6(Orientation) from the context.¶

In [13]:
df.drop(['X6'],axis=1,inplace=True)
In [14]:
df
Out[14]:
X1 X2 X3 X4 X5 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7.0 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ... ...
763 0.64 784.0 343.0 220.50 3.5 0.4 5 17.88 21.40
764 0.62 808.5 367.5 220.50 3.5 0.4 5 16.54 16.88
765 0.62 808.5 367.5 220.50 3.5 0.4 5 16.44 17.11
766 0.62 808.5 367.5 220.50 3.5 0.4 5 16.48 16.61
767 0.62 808.5 367.5 220.50 3.5 0.4 5 16.64 16.03

768 rows × 9 columns

2. From the above heatmap, we know that X1 and X2 are highly negatively correlated with each other. We have remove the feature which having less correlation with respect to output variables.¶

X1 having around 0.63 correlation on output variables Y1 & Y2.¶

X2 having around -0.67 correlation on output variables Y1 & Y2.¶

Here, we have to remove X1(Relative Compactness)¶

In [15]:
df.drop(['X1'],axis=1,inplace=True)
In [16]:
df
Out[16]:
X2 X3 X4 X5 X7 X8 Y1 Y2
0 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
1 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
2 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
3 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
4 563.5 318.5 122.50 7.0 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ...
763 784.0 343.0 220.50 3.5 0.4 5 17.88 21.40
764 808.5 367.5 220.50 3.5 0.4 5 16.54 16.88
765 808.5 367.5 220.50 3.5 0.4 5 16.44 17.11
766 808.5 367.5 220.50 3.5 0.4 5 16.48 16.61
767 808.5 367.5 220.50 3.5 0.4 5 16.64 16.03

768 rows × 8 columns

we can change the value of X8. 0 is represented as 0 and rest of them represented as 1.¶

In [17]:
df.loc[(df['X8']>0), 'X8'] = 1
df
Out[17]:
X2 X3 X4 X5 X7 X8 Y1 Y2
0 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
1 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
2 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
3 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
4 563.5 318.5 122.50 7.0 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ...
763 784.0 343.0 220.50 3.5 0.4 1 17.88 21.40
764 808.5 367.5 220.50 3.5 0.4 1 16.54 16.88
765 808.5 367.5 220.50 3.5 0.4 1 16.44 17.11
766 808.5 367.5 220.50 3.5 0.4 1 16.48 16.61
767 808.5 367.5 220.50 3.5 0.4 1 16.64 16.03

768 rows × 8 columns

Now, the values of X8 gets changed.¶

In [18]:
plx.imshow(df.corr(),height=750,width=750,text_auto=True)

Insight¶

1. After change in X8 feature, the correlation between X7 and X8 is increased and also with output variable Y1 & Y2.¶

In [19]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X2      768 non-null    float64
 1   X3      768 non-null    float64
 2   X4      768 non-null    float64
 3   X5      768 non-null    float64
 4   X7      768 non-null    float64
 5   X8      768 non-null    int64  
 6   Y1      768 non-null    float64
 7   Y2      768 non-null    float64
dtypes: float64(7), int64(1)
memory usage: 48.1 KB
In [20]:
plx.box(x = df['X8'],y=df['Y1'],color=df['X8'])
In [21]:
plx.box(x = df['X8'],y=df['Y1'],color=df['X8'])

Now, Everything is good in dataset. We can move forward to the model development.¶

Model Development¶

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import r2_score
from tqdm import tqdm
In [23]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df.drop(['X2','X3','X4','X5','X7','X8'],axis=1)
lr_trn_score,rfr_trn_score,sgd_trn_score,en_trn_score,abr_trn_score,gbr_trn_score,svr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[],[],[],[]
lr_test_score,rfr_test_score,sgd_test_score,en_test_score,abr_test_score,gbr_test_score,svr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    sgd = MultiOutputRegressor(SGDRegressor()).fit(x_train,y_train)
    pred = sgd.predict(x_test)
    pred_trn = sgd.predict(x_train)
    sgd_test_score.append(r2_score(y_test, pred))
    sgd_trn_score.append(r2_score(y_train, pred_trn))
    
    en = ElasticNet().fit(x_train,y_train)
    pred = en.predict(x_test)
    pred_trn = en.predict(x_train)
    en_test_score.append(r2_score(y_test, pred))
    en_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = MultiOutputRegressor(AdaBoostRegressor()).fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = MultiOutputRegressor(GradientBoostingRegressor()).fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
    
    svr = MultiOutputRegressor(SVR()).fit(x_train,y_train)
    pred = svr.predict(x_test)
    pred_trn = svr.predict(x_train)
    svr_test_score.append(r2_score(y_test, pred))
    svr_trn_score.append(r2_score(y_train, pred_trn))
    
    xgb = MultiOutputRegressor(XGBRegressor()).fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = MultiOutputRegressor(CatBoostRegressor(verbose=0)).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))
100%|██████████| 1000/1000 [30:07<00:00,  1.81s/it] 

1. Linear Regression¶

In [24]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = lr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = lr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Linear Regression')
fig.show()

2. SGDRegressor¶

In [25]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = sgd_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = sgd_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on SGDRegressor')
fig.show()

3. ElasticNet¶

In [26]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = en_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = en_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on ElasticNet Regression')
fig.show()

4. AdaBoostRegressor¶

In [27]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = abr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = abr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on AdaBoostRegressor')
fig.show()

5.GradientBoostingRegressor¶

In [28]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = gbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = gbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on GradientBoostingRegressor')
fig.show()

6. SVR¶

In [29]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = svr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = svr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Support Vector Regressor')
fig.show()

7. XGBRegressor¶

In [30]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = xgb_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = xgb_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on XGBRegressor')
fig.show()

8. CatBoostRegressor¶

In [31]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

9. RandomForestRegressor¶

In [32]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = rfr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = rfr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on RandomForestRegressor')
fig.show()

From the above visualization, we know that Boosting Algorithm predicts better than other algorithms. Both, train and test score(r2_score) is good is Boosting Algorithm around 0.985.¶

In [33]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df.drop(['X2','X3','X4','X5','X7','X8'],axis=1)
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = MultiOutputRegressor(CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100)).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.9854413978366589 0.9814623383584027
In [34]:
y1_pred,y2_pred = [],[]
for i in range(len(pred)):
    y1_pred.append(pred[i][0])
    y2_pred.append(pred[i][1])
In [35]:
def visulaize_performance_of_the_model(pred, y_test, modelname):
    # Plotting both line & scatter plot in same graph of predicted values to check the performance of the model in visualization.
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=np.arange(0,50), y=np.arange(0,50),
                             mode='lines',
                             name='perfectline'))
    fig.add_trace(go.Scatter(x=pred, y=y_test,
                             mode='markers',
                             name='predictions'))
    fig.update_layout(
        title=f"Performance of {modelname} on Test data",
        xaxis_title="Predicted",
        yaxis_title="Actual",
        font=dict(
            family="Courier New, monospace",
            size=13,
            color="RebeccaPurple"
        )
    )
    fig.show()
In [36]:
visulaize_performance_of_the_model(y1_pred, y_test['Y1'], 'CatBoost regressor')
In [37]:
visulaize_performance_of_the_model(y2_pred, y_test['Y2'], 'CatBoost regressor')

From the above graph, we know that prediction on Y1 is too good. But on Y2, a bit poor when comapred to Y1. So, we decided to predict Y1 and Y2 individually with 2 models. Let's find the best algorithm for Y1 and Y2.¶

1. Prediction on Y1(Heating Load)¶

In [38]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y1']
lr_trn_score,rfr_trn_score,abr_trn_score,gbr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[]
lr_test_score,rfr_test_score,abr_test_score,gbr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = AdaBoostRegressor().fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = GradientBoostingRegressor().fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
     
    xgb = XGBRegressor().fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = CatBoostRegressor(verbose=0).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))
100%|██████████| 1000/1000 [14:46<00:00,  1.13it/s] 

1. Linear Regression¶

In [39]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = lr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = lr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Linear Regression')
fig.show()

2. AdaBoostRegressor¶

In [40]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = abr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = abr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on AdaBoostRegressor')
fig.show()

3. GradientBoostingRegressor¶

In [41]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = gbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = gbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on GradientBoostingRegressor')
fig.show()

4. XGBRegressor¶

In [42]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = xgb_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = xgb_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on XGBRegressor')
fig.show()

5. CatBoostRegressor¶

In [43]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

6. RandomForestRegressor¶

In [44]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = rfr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = rfr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on RandomForestRegressor')
fig.show()

2. Prediction on Y2(Cooling Load)¶

In [45]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y2']
lr_trn_score,rfr_trn_score,abr_trn_score,gbr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[]
lr_test_score,rfr_test_score,abr_test_score,gbr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = AdaBoostRegressor().fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = GradientBoostingRegressor().fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
     
    xgb = XGBRegressor().fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = CatBoostRegressor(verbose=0).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))
100%|██████████| 1000/1000 [14:05<00:00,  1.18it/s]

1. Linear Regression¶

In [46]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = lr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = lr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Linear Regression')
fig.show()

2. AdaBoostRegressor¶

In [47]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = abr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = abr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on AdaBoostRegressor')
fig.show()

3. GradientBoostingRegressor¶

In [48]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = gbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = gbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on GradientBoostingRegressor')
fig.show()

4. XGBRegressor¶

In [49]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = xgb_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = xgb_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on XGBRegressor')
fig.show()

5. CatBoostRegressor¶

In [50]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

6. RandomForestRegressor¶

In [51]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = rfr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = rfr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on RandomForestRegressor')
fig.show()
In [52]:
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.9944150509001013 0.9829129866383302
In [53]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')
In [54]:
temp_df = df.loc[df['Y2'] > 25]
temp_df
Out[54]:
X2 X3 X4 X5 X7 X8 Y1 Y2
4 563.5 318.5 122.5 7.0 0.0 0 20.84 28.28
5 563.5 318.5 122.5 7.0 0.0 0 21.46 25.38
6 563.5 318.5 122.5 7.0 0.0 0 20.71 25.16
7 563.5 318.5 122.5 7.0 0.0 0 19.68 29.60
8 588.0 294.0 147.0 7.0 0.0 0 19.50 27.30
... ... ... ... ... ... ... ... ...
739 637.0 343.0 147.0 7.0 0.4 1 40.79 44.87
740 661.5 416.5 122.5 7.0 0.4 1 38.82 39.37
741 661.5 416.5 122.5 7.0 0.4 1 39.72 39.80
742 661.5 416.5 122.5 7.0 0.4 1 39.31 37.79
743 661.5 416.5 122.5 7.0 0.4 1 39.86 38.18

368 rows × 8 columns

In [55]:
temp_df['X8'].value_counts()
Out[55]:
1    354
0     14
Name: X8, dtype: int64
In [56]:
temp_df.loc[temp_df['X8'] == 0]
Out[56]:
X2 X3 X4 X5 X7 X8 Y1 Y2
4 563.5 318.5 122.5 7.0 0.0 0 20.84 28.28
5 563.5 318.5 122.5 7.0 0.0 0 21.46 25.38
6 563.5 318.5 122.5 7.0 0.0 0 20.71 25.16
7 563.5 318.5 122.5 7.0 0.0 0 19.68 29.60
8 588.0 294.0 147.0 7.0 0.0 0 19.50 27.30
11 588.0 294.0 147.0 7.0 0.0 0 18.31 27.87
16 637.0 343.0 147.0 7.0 0.0 0 28.52 37.73
17 637.0 343.0 147.0 7.0 0.0 0 29.90 31.27
18 637.0 343.0 147.0 7.0 0.0 0 29.63 30.93
19 637.0 343.0 147.0 7.0 0.0 0 28.75 39.44
20 661.5 416.5 122.5 7.0 0.0 0 24.77 29.79
21 661.5 416.5 122.5 7.0 0.0 0 23.93 29.68
22 661.5 416.5 122.5 7.0 0.0 0 24.77 29.79
23 661.5 416.5 122.5 7.0 0.0 0 23.93 29.40

Here, we're going to revert X8 values. Because the prediction of Y2(Cooling Load) causing a huge error when it tries to predict over the value of 25.¶

In [57]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
df
Out[57]:
X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7.0 2 0.0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7.0 3 0.0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7.0 4 0.0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7.0 5 0.0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7.0 2 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ... ... ...
763 0.64 784.0 343.0 220.50 3.5 5 0.4 5 17.88 21.40
764 0.62 808.5 367.5 220.50 3.5 2 0.4 5 16.54 16.88
765 0.62 808.5 367.5 220.50 3.5 3 0.4 5 16.44 17.11
766 0.62 808.5 367.5 220.50 3.5 4 0.4 5 16.48 16.61
767 0.62 808.5 367.5 220.50 3.5 5 0.4 5 16.64 16.03

768 rows × 10 columns

we're going to remove X1 and X6. The reason is the same as we mentioned earlier in this notebook.¶

In [58]:
df.drop(['X1','X6'],axis=1,inplace=True)
df
Out[58]:
X2 X3 X4 X5 X7 X8 Y1 Y2
0 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
1 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
2 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
3 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
4 563.5 318.5 122.50 7.0 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ...
763 784.0 343.0 220.50 3.5 0.4 5 17.88 21.40
764 808.5 367.5 220.50 3.5 0.4 5 16.54 16.88
765 808.5 367.5 220.50 3.5 0.4 5 16.44 17.11
766 808.5 367.5 220.50 3.5 0.4 5 16.48 16.61
767 808.5 367.5 220.50 3.5 0.4 5 16.64 16.03

768 rows × 8 columns

In [59]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.977063447328585 0.9401598996120876
In [60]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

Scenario 1 : With X8 feature and Without Y1 feature.¶

In [61]:
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.9961402649937874 0.9825615016523755
In [62]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

Sceanrio 2 : With X8 and Y1 feature.¶

In [63]:
temp_df = df.drop(['X8'],axis=1)
temp_df
Out[63]:
X2 X3 X4 X5 X7 Y1 Y2
0 514.5 294.0 110.25 7.0 0.0 15.55 21.33
1 514.5 294.0 110.25 7.0 0.0 15.55 21.33
2 514.5 294.0 110.25 7.0 0.0 15.55 21.33
3 514.5 294.0 110.25 7.0 0.0 15.55 21.33
4 563.5 318.5 122.50 7.0 0.0 20.84 28.28
... ... ... ... ... ... ... ...
763 784.0 343.0 220.50 3.5 0.4 17.88 21.40
764 808.5 367.5 220.50 3.5 0.4 16.54 16.88
765 808.5 367.5 220.50 3.5 0.4 16.44 17.11
766 808.5 367.5 220.50 3.5 0.4 16.48 16.61
767 808.5 367.5 220.50 3.5 0.4 16.64 16.03

768 rows × 7 columns

In [64]:
X = temp_df.drop(['Y2'],axis=1)
Y = temp_df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.9939310855310725 0.9892019517369207
In [65]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

Scenario 3 : Without X8 and with Y1 feature.¶

In [66]:
X = temp_df.drop(['Y1','Y2'],axis=1)
Y = temp_df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.9710728626727535 0.9747745043013276
In [67]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

Scenario 4 : Without x8 and Y1 feature.¶

In [68]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
temp_df = df.drop(['X1','X6'],axis=1)
temp_df
Out[68]:
X2 X3 X4 X5 X7 X8 Y1 Y2
0 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
1 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
2 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
3 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
4 563.5 318.5 122.50 7.0 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ...
763 784.0 343.0 220.50 3.5 0.4 5 17.88 21.40
764 808.5 367.5 220.50 3.5 0.4 5 16.54 16.88
765 808.5 367.5 220.50 3.5 0.4 5 16.44 17.11
766 808.5 367.5 220.50 3.5 0.4 5 16.48 16.61
767 808.5 367.5 220.50 3.5 0.4 5 16.64 16.03

768 rows × 8 columns

In [69]:
temp_df.loc[(temp_df['X8'] > 0), 'X8']=1
temp_df
Out[69]:
X2 X3 X4 X5 X7 X8 Y1 Y2
0 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
1 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
2 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
3 514.5 294.0 110.25 7.0 0.0 0 15.55 21.33
4 563.5 318.5 122.50 7.0 0.0 0 20.84 28.28
... ... ... ... ... ... ... ... ...
763 784.0 343.0 220.50 3.5 0.4 1 17.88 21.40
764 808.5 367.5 220.50 3.5 0.4 1 16.54 16.88
765 808.5 367.5 220.50 3.5 0.4 1 16.44 17.11
766 808.5 367.5 220.50 3.5 0.4 1 16.48 16.61
767 808.5 367.5 220.50 3.5 0.4 1 16.64 16.03

768 rows × 8 columns

In [70]:
X = temp_df.drop(['Y2'],axis=1)
Y = temp_df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.9957190858593685 0.9721345451651117
In [71]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

Scenario 5 : With modified X8 and Y1 feature¶

From the above 5 scenario's, we can pic 2nd or 5th . The reason is, only few points away from center line when compared to other 3 scenario's. With X8 and with Y1 feature, the Y2 predictions are good.¶

We can check this for Y1 prediction also. We have to know whether X8 is considered or not.¶

In [72]:
X = temp_df.drop(['Y1','Y2'],axis=1)
Y = temp_df['Y1']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))
0.9980566490700827 0.9974422531623282
In [73]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

Without X8, Y1 prediction is also good. But it is bit important for X7 feature. having 0.2 coorelation with each other. So, we can include X8 for Y1 Prediction. So, we have to create two seperate models for better predictions.¶

1 .Prediction of Y1 with the independent variables of X2, X3, X4,X5,X7 and X8.¶

2 .Prediction of Y2 with the independent variables of X2, X3, X4,X5,X7,X8 and dependent feature of Y1.¶

In [74]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y1']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')
0.9999073199670971
In [75]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')
0.9992043779559775
In [76]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['X1','X6','Y1','Y2'],axis=1)
Y = df['Y1']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')
0.9987571086021994
In [77]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['X1','X6','Y2'],axis=1)
Y = df['Y2']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')
0.9962330440522129

From the above graphs, It clearly shows that Dimensionality Reduction leads to performance drop. We can check by running again without changing anything from the dataset.¶

In [78]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y1','Y2'],axis=1)
Y = df.drop(['X1','X2','X3','X4','X5','X6','X7','X8'],axis=1)
lr_trn_score,rfr_trn_score,sgd_trn_score,en_trn_score,abr_trn_score,gbr_trn_score,svr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[],[],[],[]
lr_test_score,rfr_test_score,sgd_test_score,en_test_score,abr_test_score,gbr_test_score,svr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    sgd = MultiOutputRegressor(SGDRegressor()).fit(x_train,y_train)
    pred = sgd.predict(x_test)
    pred_trn = sgd.predict(x_train)
    sgd_test_score.append(r2_score(y_test, pred))
    sgd_trn_score.append(r2_score(y_train, pred_trn))
    
    en = ElasticNet().fit(x_train,y_train)
    pred = en.predict(x_test)
    pred_trn = en.predict(x_train)
    en_test_score.append(r2_score(y_test, pred))
    en_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = MultiOutputRegressor(AdaBoostRegressor()).fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = MultiOutputRegressor(GradientBoostingRegressor()).fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
    
    svr = MultiOutputRegressor(SVR()).fit(x_train,y_train)
    pred = svr.predict(x_test)
    pred_trn = svr.predict(x_train)
    svr_test_score.append(r2_score(y_test, pred))
    svr_trn_score.append(r2_score(y_train, pred_trn))
    
    xgb = MultiOutputRegressor(XGBRegressor()).fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = MultiOutputRegressor(CatBoostRegressor(verbose=0)).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))
100%|██████████| 1000/1000 [34:04<00:00,  2.04s/it] 

CatBoostRegressor¶

In [79]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

The r2 score is around 98 when dimensionality reduction and feature engineering applied. But, without dimensionality reduction and feature engineering, the r2 score is more than 0.995 on both train and test score.¶

In [80]:
print("Train Accuracy :",np.mean(cbr_trn_score)*100)
print("Test Accuracy :",np.mean(cbr_test_score)*100)
Train Accuracy : 99.94105122035963
Test Accuracy : 99.68085255927099
In [81]:
pred
Out[81]:
array([[36.2183 , 39.1086 ],
       [12.1999 , 14.9466 ],
       [36.5728 , 37.2743 ],
       [14.6368 , 17.0479 ],
       [24.2178 , 25.9339 ],
       [32.2173 , 33.1127 ],
       [36.1482 , 36.2495 ],
       [25.3945 , 26.8092 ],
       [14.2866 , 15.0643 ],
       [29.2729 , 30.569  ],
       [14.8952 , 15.576  ],
       [12.9644 , 15.6852 ],
       [26.2406 , 28.0722 ],
       [42.0551 , 42.334  ],
       [40.4267 , 39.8254 ],
       [36.6571 , 36.9522 ],
       [28.5724 , 31.5279 ],
       [14.4487 , 16.7673 ],
       [14.4393 , 17.0646 ],
       [23.9168 , 25.5309 ],
       [32.2052 , 35.2163 ],
       [12.3751 , 15.3781 ],
       [39.3341 , 43.0809 ],
       [29.4039 , 29.5464 ],
       [14.2608 , 17.0458 ],
       [29.0632 , 31.4638 ],
       [12.8261 , 15.9376 ],
       [36.468  , 36.9184 ],
       [12.9689 , 15.8419 ],
       [26.6294 , 29.3116 ],
       [10.3552 , 13.6203 ],
       [11.5032 , 14.1564 ],
       [29.1492 , 31.2337 ],
       [16.7129 , 20.1717 ],
       [32.5967 , 33.9185 ],
       [41.8504 , 41.3938 ],
       [29.3278 , 30.5144 ],
       [19.9712 , 25.3508 ],
       [11.3147 , 13.9994 ],
       [25.8034 , 29.9774 ],
       [17.0463 , 17.2059 ],
       [26.4909 , 27.1038 ],
       [12.9759 , 15.7958 ],
       [36.3722 , 39.7724 ],
       [12.4379 , 15.2596 ],
       [32.3041 , 32.8554 ],
       [19.4    , 22.6602 ],
       [14.5053 , 16.7478 ],
       [36.3896 , 37.273  ],
       [26.0264 , 29.3768 ],
       [28.7723 , 29.4696 ],
       [13.9855 , 16.055  ],
       [18.8343 , 21.8517 ],
       [11.1023 , 14.1537 ],
       [12.6922 , 14.2257 ],
       [35.959  , 36.6293 ],
       [ 6.39752, 11.4227 ],
       [12.3571 , 14.9618 ],
       [15.1152 , 18.169  ],
       [28.9095 , 30.4197 ],
       [29.5598 , 31.1404 ],
       [36.6148 , 37.0465 ],
       [ 7.1943 , 12.3373 ],
       [28.8042 , 30.7659 ],
       [12.7798 , 14.0862 ],
       [12.9721 , 15.9404 ],
       [15.1621 , 19.3701 ],
       [11.1977 , 14.1137 ],
       [14.9104 , 15.68   ],
       [16.5147 , 16.6574 ],
       [32.5907 , 32.8658 ],
       [14.5381 , 17.0505 ],
       [15.0914 , 18.1621 ],
       [12.4165 , 15.1511 ],
       [25.8077 , 29.6147 ],
       [14.4955 , 17.2472 ],
       [29.3345 , 29.7631 ],
       [35.2853 , 37.8541 ],
       [10.3908 , 13.609  ],
       [25.3517 , 26.5098 ],
       [12.2314 , 15.2061 ],
       [32.2517 , 33.359  ],
       [11.576  , 14.1769 ],
       [14.8518 , 15.6279 ],
       [25.7311 , 30.3504 ],
       [32.7338 , 34.2769 ],
       [24.2097 , 29.6783 ],
       [15.1874 , 19.247  ],
       [10.4127 , 13.6406 ],
       [11.1059 , 14.154  ],
       [ 7.2058 , 12.3342 ],
       [36.3363 , 36.8157 ],
       [32.4119 , 33.8526 ],
       [25.4973 , 27.6092 ],
       [11.5408 , 13.7549 ],
       [39.6898 , 40.22   ],
       [17.1522 , 17.2261 ],
       [35.6129 , 37.2382 ],
       [18.8684 , 22.0105 ],
       [23.877  , 25.8006 ],
       [15.2372 , 17.7978 ],
       [26.2085 , 28.2379 ],
       [24.6007 , 26.4439 ],
       [32.7914 , 34.3675 ],
       [11.217  , 14.3025 ],
       [32.4735 , 34.2448 ],
       [16.6723 , 16.1324 ],
       [11.2405 , 14.3703 ],
       [26.254  , 28.2222 ],
       [15.3025 , 19.3156 ],
       [32.6635 , 33.2931 ],
       [14.4026 , 14.9734 ],
       [11.1536 , 14.3173 ],
       [29.2478 , 31.0659 ],
       [12.7292 , 14.2678 ],
       [13.0975 , 15.614  ],
       [28.7251 , 32.0108 ],
       [29.2731 , 30.9188 ],
       [12.4526 , 15.2772 ],
       [16.8303 , 24.1076 ],
       [12.8109 , 16.0854 ],
       [15.166  , 19.3458 ],
       [29.4558 , 30.9142 ],
       [19.4516 , 24.8333 ],
       [12.8066 , 14.1722 ],
       [29.1944 , 30.9232 ],
       [29.086  , 29.7849 ],
       [16.9344 , 20.5372 ],
       [15.2522 , 19.2634 ],
       [28.7461 , 31.4444 ],
       [11.8923 , 14.6332 ],
       [16.4508 , 16.9661 ],
       [ 8.6474 , 12.2    ],
       [14.4671 , 15.3598 ],
       [31.9033 , 34.6088 ],
       [10.7261 , 14.048  ],
       [32.2099 , 34.0358 ],
       [32.3904 , 33.1941 ],
       [12.3671 , 15.2376 ],
       [ 6.40532, 11.5887 ],
       [16.4114 , 17.0544 ],
       [32.5872 , 34.1049 ],
       [15.1811 , 17.6493 ],
       [10.4097 , 13.5978 ],
       [24.268  , 26.0036 ],
       [32.576  , 33.2337 ],
       [39.3769 , 40.4777 ],
       [14.37   , 17.0741 ],
       [14.0379 , 16.1653 ],
       [28.6452 , 33.3159 ],
       [12.6679 , 15.6556 ],
       [29.2666 , 30.0071 ],
       [29.649  , 28.7731 ],
       [28.1403 , 33.8339 ]])
In [82]:
y_test
Out[82]:
Y1 Y2
112 35.65 41.07
414 12.10 15.57
256 37.03 34.99
561 14.70 17.00
194 24.04 26.18
... ... ...
199 29.79 29.92
466 12.67 15.83
148 28.07 34.14
393 29.40 32.93
151 29.05 29.67

154 rows × 2 columns

In [83]:
test_values = pd.DataFrame(y_test)
test_values.reset_index(drop=True,inplace=True)
test_values
Out[83]:
Y1 Y2
0 35.65 41.07
1 12.10 15.57
2 37.03 34.99
3 14.70 17.00
4 24.04 26.18
... ... ...
149 29.79 29.92
150 12.67 15.83
151 28.07 34.14
152 29.40 32.93
153 29.05 29.67

154 rows × 2 columns

In [84]:
result = pd.DataFrame(pred, columns = ['Predicted Y1', 'Predicted Y2'])
result
Out[84]:
Predicted Y1 Predicted Y2
0 36.2183 39.1086
1 12.1999 14.9466
2 36.5728 37.2743
3 14.6368 17.0479
4 24.2178 25.9339
... ... ...
149 28.6452 33.3159
150 12.6679 15.6556
151 29.2666 30.0071
152 29.6490 28.7731
153 28.1403 33.8339

154 rows × 2 columns

In [85]:
final_y1 = pd.merge(test_values['Y1'], result['Predicted Y1'], left_index=True,right_index=True)
final_y1
Out[85]:
Y1 Predicted Y1
0 35.65 36.2183
1 12.10 12.1999
2 37.03 36.5728
3 14.70 14.6368
4 24.04 24.2178
... ... ...
149 29.79 28.6452
150 12.67 12.6679
151 28.07 29.2666
152 29.40 29.6490
153 29.05 28.1403

154 rows × 2 columns

In [86]:
final_y2 = pd.merge(test_values['Y2'], result['Predicted Y2'], left_index=True,right_index=True)
final_y2
Out[86]:
Y2 Predicted Y2
0 41.07 39.1086
1 15.57 14.9466
2 34.99 37.2743
3 17.00 17.0479
4 26.18 25.9339
... ... ...
149 29.92 33.3159
150 15.83 15.6556
151 34.14 30.0071
152 32.93 28.7731
153 29.67 33.8339

154 rows × 2 columns

In [87]:
visulaize_performance_of_the_model(final_y2['Y2'], final_y2['Predicted Y2'], 'CatBoost regressor')
In [88]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
cbr_trn_score = []
cbr_test_score = []
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
        
    cbr = CatBoostRegressor(verbose=0).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))   
100%|██████████| 1000/1000 [15:17<00:00,  1.09it/s]
In [89]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()
In [90]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

In our project development, we're going to follow the below rules :¶

1 .Prediction of Y1 with the independent variables of X1,X2, X3, X4,X5,X6,X7 and X8.¶

2 .Prediction of Y2 with the independent variables of X1,X2, X3, X4,X5,X6,X7,X8 and dependent feature of Y1. The reason for adding Y1 for Y2 prediction is¶